arXiv:1506.02494v3 [stat.ME] 18 Nov 2015 


backShift: Learning causal cyclic graphs from 
unknown shift interventions 


Dominik Rothenhausler* 

Seminar fur Statistik 
ETH Zurich, Switzerland 

rothenhaeusler@stat.math.ethz.ch 


Christina Heinze* 

Seminar fiir Statistik 
ETH Zurich, Switzerland 

heinze@stat.math.ethz.ch 


Jonas Peters 

Max Planck Institute for Intelligent Systems 
Tubingen, Germany 

jonas.peters@tuebingen.mpg.de 


Nicolai Meinshausen 

Seminar fiir Statistik 
ETH Zurich, Switzerland 

meinshausen@stat.math.ethz.ch 


Abstract 

We propose a simple method to learn linear causal cyclic models in the presence 
of latent variables. The method relies on equilibrium data of the model recorded 
under a specific kind of interventions (“shift interventions”). The location and 
strength of these interventions do not have to be known and can be estimated from 
the data. Our method, called BACKSHIFT, only uses second moments of the data 
and performs simple joint matrix diagonalization, applied to differences between 
covariance matrices. We give a sufficient and necessary condition for identifi- 
ability of the system, which is fulfilled almost surely under some quite general 
assumptions if and only if there are at least three distinct experimental settings, 
one of which can be pure observational data. We demonstrate the performance on 
some simulated data and applications in flow cytometry and financial time series. 


1 Introduction 

Discovering causal effects is a fundamentally important yet very challenging task in various disci¬ 
plines, from public health research and sociological studies, economics to many applications in the 
life sciences. There has been much progress on learning acyclic graphs in the context of structural 
equation models Ul, including methods that learn from observational data alone under a faithful¬ 
ness assumption E] 13 |4] |5l, exploiting non-Gaussianity of the data E |3 or non-linearities isi. 
Eeedbacks are prevalent in most applications, and we are interested in the setting of El, where we 
observe the equilibrium data of a model that is characterized by a set of linear relations 

X = Bx + e, (1) 

where x € is a random vector and B G is the connectivity matrix with zeros on the diag¬ 
onal (no self-loops). Allowing for self-loops would lead to an identifiability problem, independent 
of the method. See Section in the Appendix for more details on this setting. The graph corre¬ 
sponding to B has p nodes and an edge from node j to node i if and only if TSij ^ 0. The error 
terms e are p-dimensional random variables with mean 0 and positive semi-definite covariance ma¬ 
trix Sg = E{ee^). We do not assume that Sg is a diagonal matrix which allows the existence of 
latent variables. 

The solutions to ([T]i can be thought of as the deterministic equilibrium solutions (conditional on the 
noise term) of a dynamic model governed by first-order difference equations with matrix B in the 
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sense of flo). For well-defined equilibrium solutions of Q, we need that I —B is invertible. Usually 
we also want ([T]) to converge to an equilibrium when iterating as •(— -f e or in other 

words limm_>oo B"* = 0. This condition is equivalent to the spectral radius of B being strictly 
smaller than one fm. We will make an assumption on cyclic graphs that restricts the strength of the 
feedback. Specifically, let a cycle of length 77 be given by (toi, ..., = mi) G {1,... 

and mk ^ mi for 1 < k < i < rj. We define the cycle-product C'P(B) of a matrix B to be the 
maximum over cycles of all lengths 1 < 77 < p of the path-products 


CP(B) := max 

(mi,cycle 

l<77<p 


n iB 




l<k<7) 


( 2 ) 


The cycle-product C'P(B) is clearly zero for acyclic graphs. We will assume the cycle-product to be 
strictly smaller than one for identifiability results, see Assumption (A) below. The most interesting 
graphs are those for which CT’(B) < 1 and for which the spectral radius of B is strictly smaller than 
one. Note that these two conditions are identical as long as the cycles in the graph do not intersect, 
i.e., there is no node that is part of two cycles (for example if there is at most one cycle in the graph). 
If cycles do intersect, we can have models for which either (i) C'P(B) < 1 but the spectral radius 
is larger than one or (ii) C'P(B) > 1 but the spectral radius is strictly smaller than one. Models in 
situation (ii) are not stable in the sense that the iterations will not converge under interventions. We 
can for example block all but one cycle. If this one single unblocked cycle has a cycle-product larger 
than 1 (and there is such a cycle in the graph if CP(B) > 1), then the solutions of the iteration are 
not sta bl^] Models in situation (i) are not stable either, even in the absence of interventions. We 
can still in theory obtain the now instable equilibrium solutions to Q as (I — B) and the theory 
below applies to these instable equilibrium solutions. However, such instable equilibrium solutions 
are arguably of little practical interest. In summary; all interesting feedback models that are stable 
under interventions satisfy both C'P(B) < 1 and have a spectral radius strictly smaller than one. We 
will just assume UPjB) < 1 for the following results. 

It is impossible to learn the structure B of this model from observational data alone without making 
further assumptions. The LiNGAM approach has been extended in im to cyclic models, exploiting 
a possible non-Gaussianity of the data. Using both experimental and interventional data, MM 
could show identifiability of the connectivity matrix B under a learning mechanism that relies on 
data under so-called “surgical” or “perfect” interventions. In their framework, a variable becomes 
independent of all its parents if it is being intervened on and all incoming contributions are thus ef¬ 
fectively removed under the intervention (also called do-interventions in the classical sense of ifTSl I. 
The learning mechanism makes then use of the knowledge where these “surgical” interventions oc¬ 
curred. na also allow for “changing” the incoming arrows for variables that are intervened on; 
but again, M requires the location of the interventions while we do not assume such knowledge. 
US consider a target variable and allow for arbitrary interventions on all other nodes. They neither 
permit hidden variables nor cycles. 

Here, we are interested in a setting where we have either no or just very limited knowledge about 
the exact location and strength of the interventions, as is often the case for data observed under 
different environments (see the example on financial time series further below) or for biological data 
ifTblfTTl . These interventions have been called “fat-hand” or “uncertain” interventions IITSl . While 
m assume acyclicity and model the structure explicitly in a Bayesian setting, we assume that the 
data in environment j are equilibrium observations of the model 

Xj = Bxj -f Cj + Gj, (3) 

where the random intervention shift Cj has a mean and covariance Scj- The location of these 
interventions (or simply the intervened variables) are those components of Cj that are not zero 
with probability one. Given these locations, the interventions simply shift the variables by a value 
determined by c^; they are therefore not “surgical” but can be seen as a special case of what is 
called an “imperfect”, “parametric” CU or “dependent” intervention ll 20 l or “mechanism change” 
ED- The matrix B and the error distribution of ej are assumed to be identical in all environments. 
In contrast to the covariance matrix for the noise term ej, we do assume that Scj is a diagonal 

*The blocking of all but one cycle can be achieved by do-interventions on appropriate variables under the 
following condition: for every pair of cycles in the graph, the variables in one cycle cannot be a subset of the 
variables in the other cycle. Otherwise the blocking could be achieved by deletion of appropriate edges. 


2 



matrix, which is equivalent to demanding that interventions at different variables are uncorrelated. 
This is a key assumption necessary to identify the model using experimental data. Furthermore, we 
will discuss in Section |4~2] how a violation of the model assumption 0 can be detected and used to 
estimate the location of the interventions. 

In Section|^we show how to leverage observations under different environments with different inter¬ 
ventional distributions to learn the structure of the connectivity matrix B in model (|^. The method 
rests on a simple joint matrix diagonalization. We will prove necessary and sufficient conditions for 
identifiability in Section]^ Numerical results for simulated data and applications in flow cytometry 
and financial data are shown in Section|4] 

2 Method 

2.1 Grouping of data 

Let J be the set of experimental conditions under which we observe equilibrium data from 
model 0. These different experimental conditions can arise in two ways: (a) a controlled ex¬ 
periment was conducted where the external input or the external imperfect interventions have been 
deliberately changed from one member of J to the next. An example are the flow cytometry data 
II 22 I discussed in Section [4^ (b) The data are recorded over time. It is assumed that the exter¬ 
nal input is changing over time but not in an explicitly controlled way. The data are grouped into 
consecutive blocks j G of observations, see Section |43] for an example. 

2.2 Notation 

Assume we have rij observations in each setting j G J. Let Xj be the (rij x p)-matrix of obser¬ 
vations from model ([^. For general random variables aj G , the population covariance matrix 
in setting j G J is called Saj = Cov(aj), where the covariance is under the setting j G 
Furthermore, the covariance on all settings except setting j G J is defined as an average over all 
environments except for the j-th environment, {\J\ — l)Sc,-j := J2j'^j\{j} j' • The population 

Gram matrix is defined as Gaj = E{ajaj'^). Let the (p x p)-dimensional Sa,j be the empirical 
covariance matrix of the observations Aj G of variable aj in setting j G J . More precisely, 

let Aj be the column-wise mean-centered version of Aj. Then Sa.j := (rij — 1) ^AjAj. The 
empirical Gram matrix is denoted by Gaj := AjAj. 

2.3 Assumptions 

The main assumptions have been stated already but we give a summary below. 

(A) The data are observations of the equilibrium observations of model ([^. The matrix I — B is 
invertible and the solutions to 0 are thus well defined. The cycle-product (|^ CPjB) is strictly 
smaller than one. The diagonal entries of B are zero. 

(B) The distribution of the noise Oj (which includes the influence of latent variables) and the con¬ 
nectivity matrix B are identical across all settings j G J . In each setting j G J , the interven¬ 
tion shift Cj and the noise Oj are uncorrelated. 

(C) Interventions at different variables in the same setting are uncorrelated, that is Sc,j is an (un¬ 
known) diagonal matrix for all j G J . 

We will discuss a stricter version of (C) in Section]^ in the Appendix that allows the use of Gram 
matrices instead of covariance matrices. The conditions above imply that the environments are 
characterized by different interventions strength, as measured by the variance of the shift c in each 
setting. We aim to reconstruct both the connectivity matrix B from observations in different en¬ 
vironments and also aim to reconstruct the a-priori unknown intervention strength and location in 
each environment. Additionally, we will show examples where we can detect violations of the model 
assumptions and use these to reconstruct the location of interventions. 

2.4 Population method 

The main idea is very simple. Looking at the model (|^, we can rewrite 

(l-B)xj = Cj+ej. (4) 
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The population covariance of the transformed observations are then for all settings j & J given by 

(I-B)Sx,,(I-Bf = Sej+Se. (5) 

The last term Se is constant across all settings j & J (but not necessarily diagonal as we allow 
hidden variables). Any change of the matrix on the left-hand side thus stems from a shift in the 
covariance matrix Sc j of the interventions. Let us define the difference between the covariance 
of c and x in setting j as 

^^Sc j *— Sq j — Sc —jf and ^^Sj. j ;= S^ j — —j- (^) 

Assumption (B) together with (|^ implies that 

(I-B)AS,,,(I-B)^ = ASc,, yj€j. (7) 

Using assumption (C), the random intervention shifts at different variables are uncorrelated and the 
right-hand side in Q is thus a diagonal matrix for all j G J. Let T) C be the set of all 

invertible matrices. We also define a more restricted space which only includes those members 
of T) that have entries all equal to one on the diagonal and have a cycle-product less than one, 

:= |d G W^P : D invertiblej (8) 

X>cp := {d G W^p : D G I? and diag(D) = 1 and CP{1 - D) < l|. (9) 

Under Assumption (A), I — B G Vcp- Motivated by Q, we now consider the minimizer 

D = argminij/gx,^^ ^ L(D'ASx.jD'^), where L{A) := ^ (10) 

is the loss L for any matrix A and defined as the sum of the squared off-diagonal elements. In 
Section we present necessary and sufficient conditions on the interventions under which D = 
I- B is the unique minimizer of ( flOl i. In this case, exact joint diagonalization is possible so that 
L(DASxjD'^) = 0 for all environments j G JZ. We discuss an alternative that replaces covariance 
with Gram matrices throughout in Sectionj^in the Appendix. We now give a finite-sample version. 


2.5 Finite-sample estimate of the connectivity matrix 


In practice, we estimate B by minimizing 
the empirical counterpart of ( [TOl i in two 
steps. First, the solution of the optimiza¬ 
tion is only constrained to matrices in T). 
Subsequently, we enforce the constraint on 
the solution to be a member of Vcp- The 
BACkShift algorithm is presented in Al¬ 
gorithm [T] and we describe the important 
steps in more detail below. 


Algorithm 1 backShift 
Input: Xj Vj G J 

1: Compute ASxj Vj G 
2: D = FFDiag(ASxj) 

3: D = PermuteAndScale(D) 
4: B = I - D 

Output: B 


Steps 1 & 2. First, we minimize the following empirical, less constrained variant of ( [T0| 

D := argmin^/gx, ^ L(D'(ASxjjD'^), (11) 

j&J 

where the population differences between covariance matrices are replaced with their empirical 
counterparts and the only constraint on the solution is that it is invertible, i.e. D G I?. For the 
optimization we use the joint approximate matrix diagonalization algorithm FFDlAG 12^ . 


Step 3. The constraint on the cycle product and the diagonal elements of D is enforced by (a) 
permuting and (b) scaling the rows of D. Part (b) simply scales the rows so that the diagonal 
elements of the resulting matrix D are all equal to one. The more challenging first step (a) consists 
of finding a permutation such that under this permutation the scaled matrix from part (b) will have 
a cycle product as small as possible (as follows from Theorem at most one permutation can lead 
to a cycle product less than one). This optimization problem seems computationally challenging at 
first, but we show that it can be solved by a variant of the linear assignment problem (LAP) (see e.g. 
Erl '), as proven in Theoremj^in the Appendix. As a last step, we check whether the cycle product 
of D is less than one, in which case we have found the solution. Otherwise, no solution satisfying 
the model assumptions exists and we return a warning that the model assumptions are not met. See 
Appendix [B] for more details. 
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Computational cost. The computational complexity of BACKSHIFT is O(| | • n) as computing 

the covariance matrices costs 0(\J\-n-p^), FFDlAG has a computational cost of 0(1^71-p^) and both 
the linear assignment problem and computing the cycle product can be solved in O(p^) time. For 
instance, this complexity is achieved when using the Hungarian algorithm for the linear assignment 
problem (see e.g. Il24l ') and the cycle product can be computed with a simple dynamic programming 
approach. 

2.6 Estimating the intervention variances 

One additional benefit of BACKSHIFT is that the location and strength of the interventions can be 
estimated from the data. The empirical, plug-in version of Eq. Q is given by 

(I-B)^xj(I-B)^ = ^c., VjeTT. (12) 

So the element (AScj)fcfc is an estimate for the difference between the variance of the interven¬ 
tion at variable k in environment j, namely {'Ec,j)kk, and the average in all other environments, 
(Sc,-i)fcfe- From these differences we can compute the intervention variance for all environments 
up to an offset. By convention, we set the minimal intervention variance across all environments 
equal to zero. Alternatively, one can let observational data, if available, serve as a baseline against 
which the intervention variances are measured. 

3 Identifiability 

Let for simplicity of notation, 

■“ {^^c,j)kk 

be the variance of the random intervention shifts Cj at node k in environment j G as per the 
definition of ASc.j in (|^. We then have the following identifiability result (the proof is provided 
in Appendix [a). 

Theorem 1. Under assumptions (A), (B) and ( C), the solution to is unique if and only if for all 
fc, I G {1, ... ,p} there exist j,j' G J such that 

Bj,iVj',k ■ (13) 

If none of the intervention variances rjj ^ vanishes, the uniqueness condition is equivalent to de¬ 
manding that the ratio between the intervention variances for two variables k, I must not stay iden¬ 
tical across all environments, that is there exist j, j' G J such that 

^ ^ (14) 

^3,1 ^3',I 

which requires that the ratio of the variance of the intervention shifts at two nodes k, I is not identical 
across all settings. This leads to the following corollary. 

Corollary 2. (i) The identifiability condition ( |13[ ) cannot be satisfied if\ff\ =2 since then = 

—J 7 j / j./or all k and j ^ f. We need at least three different environments for identifiability. 

( a) The identifiability condition 03 is satisfied for all Iff \ > 3 almost surely if the variances of the 
intervention Cj are chosen independently (over all variables and environments j G fj) from a 
distribution that is absolutely continuous with respect to Lebesgue measure. 

Condition (ii) can be relaxed but shows that we can already achieve full identifiability with a very 
generic setting for three (or more) different environments. 

4 Numerical results 

In this section, we present empirical results for both synthetic and real data sets. In addition to 
estimating the connectivity matrix B, we demonstrate various ways to estimate properties of the 
interventions. Besides computing the point estimate for BACKSHIFT, we use stability selection ESll 
to assess the stability of retrieved edges. We attach R-code with which all simulations and analyses 
can be reproducecO] 

^An R-package called “backshift” is available from CRAN. 
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Figure 1: Simulated data, (a) 


True network. 


(b) Scheme for data generation, (c) 


Performance metrics for the 


settings considered in Section 4.1 For BACKSHIFT, precision and recall values for Settings 1 and 2 coincide. 


Setting 1 

Setting 2 

Setting 3 

Setting 4 

Setting 5 

n = 1000 

n = 10000 

n = 10000 

n = 10000 

n = 10000 

no hidden vars. 

no hidden vars. 

hidden vars. 

no hidden vars. 

no hidden vars. 

mj = 1 

mi = 1 

mj = 1 

mi = 0 

mi = 0.5 



SHD = 17. |f| = 0.91 SHD = 14, |t| = 0.6S SHD = 16, |t| = 0.98 SHE) = 8, |t| = 0.25 SHD = 7, |t| = 0.29 


Figure 2: Point estimates of BACKShift and Ling for synthetic data. We threshold the point estimate of 
BACKSHIFT at f = ±0.25 to exclude those entries which are close to zero. We then threshold the estimate of 
Ling so that the two estimates have the same number of edges. In Setting 4, we threshold Ling at f = ±0.25 
as BACKSHIFT returns the empty graph. In Setting 3, it is not possible to achieve the same number of edges as 
all remaining coefficients in the point estimate of Ling are equal to one in absolute value. The transparency 
of the edges illustrates the relative magnitude of the estimated coefficients. We report the structural Hamming 
distance (SHD) for each graph. Precision and recall values are shown in Figure [l[|c)] 

4.1 Synthetic data 


We compare the point estimate of BACKSHIFT against LiNG ca, a generalization of LiNGAM to 
the cyclic case for purely observational data. We consider the cyclic graph shown in Figure and 
generate data under different scenarios. The data generating mechanism is sketched in Figure ifb) 


Specifically, we generate ten distinct environments with non-Gaussian noise. In each environment, 
the random intervention variable is generated as {cj)k = where /3^,..., /3^ are drawn i.i.d. 

from Exp(m7) and If,... ,1^ are independent standard normal random variables. The intervention 
shift thus acts on all observed random variables. The parameter mj regulates the strength of the 
intervention. If hidden variables exist, the noise term (e^) 7, of variable k in environment j is equal to 
where the weights 71 ,..., 7^ are sampled once from a A/^(0, l)-distribution and the random 
variable LF-' has a Laplace(0,1) distribution. If no hidden variables are present, then {ej)k, k = 
1,... ,p is sampled i.i.d. Laplace(0,1). In this set of experiments, we consider five different settings 
(described below) in which the sample size n, the intervention strength mi as well as the existence 
of hidden variables varies. 


We allow for hidden variables in only one out of five settings as LiNG assumes causal sufficiency 
and can thus in theory not cope with hidden variables. If no hidden variables are present, the pooled 
data can be interpreted as coming from a model whose error variables follow a mixture distribution. 
But if one of the error variables comes from the second mixture component, for example, the other 
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Figure 3: Flow cytometry data, (a) Union of the consensus network (according to 1221 1. the reconstruction by 
I 22 I and the best acyclic reconstruction by 1261 . The edge thickness and intensity reflect in how many of these 
three sources that particular edge is present. |(b)| One of the cyclic reconstructions by l26l . The edge thickness 
and intensity reflect the probability of selecting that particular edge in the stability selection procedure. For 
more details see 1261 . |(c)| BACKSHIFT point estimate, thresholded at ±0.35. The edge intensity reflects the 
relative magnitude of the coefficients and the coloring is a comparison to the union of the graphs shown in 
panels [(^ and |(b)[ Blue edges were also found in 1261 and ca, purple edges are reversed and green edges were 
not previously found in |(a)| or |(b)| |(d)| BACKSHIFT stability selection result with parameters E(U) = 2 and 
TTthr = 0.75. The edge thickness illustrates how often an edge was selected in the stability selection procedure. 


error variables come from the second mixture component, too. In this sense, the data points are not 
independent anymore. This poses a challenge for LiNG which assumes an i.i.d. sample. We also 
cover a case (for mi = 0) in which all assumptions of LiNG are satisfied (Scenario 4). 

Figure]^ shows the estimated connectivity matrices for five different settings and Figure [1[f^ shows 
the obtained precision and recall values. In Setting 1, n = 1000, m/ = 1 and there are no hidden 
variables. In Setting 2, n is increased to 10000 while the other parameters do not change. We observe 
that BACkShift retrieves the correct adjacency matrix in both cases while Ling’s estimate is not 
very accurate. It improves slightly when increasing the sample size. In Setting 3, we do include 
hidden variables which violates the causal sufficiency assumption required for LiNG. Indeed, the 
estimate is worse than in Setting 2 but somewhat better than in Setting 1. BACKSHIFT retrieves 
two false positives in this case. Setting 4 is not feasible for BACKSHIFT as the distribution of the 
variables is identical in all environments (since m/ = 0). In Step 2 of the algorithm, FFDlAG does 
not converge and therefore the empty graph is returned. So the recall value is zero while precision 
is not defined. For LiNG all assumptions are satisfied and the estimate is more accurate than in 
the Settings 1-3. Lastly, Setting 5 shows that when increasing the intervention strength to 0.5, 
BACkShift returns a few false positives. Its performance is then similar to LiNG which returns its 
most accurate estimate in this scenario. The stability selection results for BACKSHIFT are provided 
in Figure]^ in Appendix]^ 

In short, these results suggest that the BACKSHIFT point estimates are close to the true graph if the 
interventions are sufficiently strong. Hidden variables make the estimation problem more difficult 
but the true graph is recovered if the strength of the intervention is increased (when increasing m/ 
to 1.5 in Setting 3, BACKSHIFT obtains a SHD of zero). In contrast, LiNG is unable to cope with 
hidden variables but also has worse accuracy in the absence of hidden variables under these shift 
interventions. 

4.2 Flow cytometry data 

The data published in is an instance of a data set where the external interventions differ be¬ 
tween the environments in J and might act on several compounds simultaneously ifTSll . There are 
nine different experimental conditions with each containing roughly 800 observations which corre¬ 
spond to measurements of the concentration of biochemical agents in single cells. The first setting 
corresponds to purely observational data. 

In addition to the original work by ll22ll . the data set has been described and analyzed in M and 
f2S\ . We compare against the results of ESj . Il22ll and the “well-established consensus”, according 
to 1221, shown in Figures [^fa)] and [^fb)| Figure [^p)| shows the (thresholded) BACKSHIFT point 
estimate. Most of the retrieved edges were also found in at least one of the previous studies. Five 
edges are reversed in our estimate and three edges were not discovered previously. Figure [^fd)| shows 
the corresponding stability selection result with the expected number of falsely selected variables 
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(a) Prices (logarithmic) 



Figure 4; Financial time series with three stock indices: NASDAQ (blue; technology index), S&P 500 (green; 
American equities) and DAX (red; German equities), (a) Prices of the three indices between May 2000 and end 
of 2011 on a logarithmic scale, (b) The scaled log-returns (daily change in log-price) of the three instruments 
are shown. Three periods of increased volatility are visible starting with the dot-com bust on the left to the 
financial crisis in 2008 and the August 2011 downturn, (c) The scaled estimated intervention variance with 
the estimated BACKSHIFT network. The three down-turns are clearly separated as originating in technology, 
American and European equities, (d) In contrast, the analogous Ling estimated intervention variances have a 
peak in American equities intervention variance during the European debt crisis in 2011. 


E(F) = 2. This estimate is sparser in comparison to the other ones as it bounds the number of false 
discoveries. Notably, the feedback loops between PIP2 fa PLCg and PKC fa JNK were also found 
in If26ll . 

It is also noteworthy that we can check the model assumptions of shift interventions, which is impor¬ 
tant for these data as they can be thought of as changing the mechanism or activity of a biochemical 
agent rather than regulate the biomarker directly ll26l . If the shift interventions are not appropri¬ 
ate, we are in general not able to diagonalize the differences in the covariance matrices. Large 
off-diagonal elements in the estimate of the r.h.s in Q indicate a mechanism change that is not just 
explained by a shift intervention as in Q. In four of the seven interventions environments with 
known intervention targets the largest mechanism violation happens directly at the presumed inter¬ 
vention target, see Appendix [C| for details. It is worth noting again that the presumed intervention 
target had not been used in reconstructing the network and mechanism violations. 

4.3 Financial time series 

Finally, we present an application in financial time series where the environment is clearly changing 
over time. We consider daily data from three stock indices NASDAQ, S&P 500 and DAX for a 
period between 2000-2012 and group the data into 74 overlapping blocks of 61 consecutive days 
each. We take log-returns, as shown in panel (b) of Figure]^ and estimate the connectivity matrix, 
which is fully connected in this case and perhaps of not so much interest in itself. It allows us, 
however, to estimate the intervention strength at each of the indices according to ( |T^ , shown in 
panel (c). The intervention variances separate very well the origins of the three major down-turns of 
the markets on the period. Technology is correctly estimated by BACKSHIFT to be at the epicenter 
of the dot-com crash in 2001 (NASDAQ as proxy), American equities during the financial crisis in 
2008 (proxy is S&P 500) and European instruments (DAX as best proxy) during the August 2011 
downturn. 


5 Conclusion 

We have shown that cyclic causal networks can be estimated if we obtain covariance matrices of 
the variables under unknown shift interventions in different environments. BACKSHIFT leverages 
solutions to the linear assignment problem and joint matrix diagonalization and the part of the com¬ 
putational cost that depends on the number of variables is at worst cubic. We have shown sufficient 
and necessary conditions under which the network is fully identifiable, which require observations 
from at least three different environments. The strength and location of interventions can also be 
reconstructed. 
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Appendix to 


BACKSHIFT: Learning causal cyclic graphs from unknown shift 

interventions 


A Identifiability - Proof of Theorem [J 


Proof, “if”: Let D' be a solution of ( [T0| l. Let us write for the m-th row of D' and for 
the m-th row of D, m = 1,... ,p. Furthermore let us dehne m = 1,... ,p. 

We will show that at most one entry of this vector is nonzero. Note that by equation 0 we have 
ASxj = for all j S J. By equation Q, L(DASxjD^) = 0. As D' solves 

equation ( [TOl l, this implies L(D'ASxjD'^) = 0 for all j G J. Hence the offdiagonal elements of 
D'ASxjD''^ are zero, which implies 

gm' -L AScjgm for all m' f m and for all j G J. 

As the gm' are linearly independent, this implies that for all pairs j, / G J, AScjgm and 
AScj'gm are collinear i.e. for all (j,/) there exists a G K such that ASc,jgm = 

or Aj j'ASc^jgm — AScj'^gm 

Take arbitrary k,l G {1,... ,p} and choose j,j' G J such that ( [T3] ) is satisfied. By the argumen¬ 
tation above, there exists a Ajj/ G K such that AScjgm = or Ajj^AScjgm = 

AScj'gm- Without loss of generality let us assume the latter. Recall that both AScj and AScj' 
are diagonal matrices. Now condition ([T^ implies that the A:-th or the Lth entry on the diagonal 
of Xjji AHc,j — AScj' is nonzero (or both). Hence, the /c-th or the Z-th entry of g^ s zero (or 
both). By repeating this argumentation for all k and I, at most one entry of g^ is nonzero. Thus, 
= D^gm = (g^D)^ is a multiple of one of the rows of D. 


By applying this argumentation for all m = 1,... ,p, each row of D' is a multiple of one of the 
rows of D. As both D and D' are invertible, there exists a bijection between the rows of D' and D 
such that the corresponding rows are collinear. Furthermore, the diagonal of D' and D is(l,...,l). 
Hence let us consider a bijection a : {1,... ,p} i—{1,... ,p} such that the cr(m)-th row of D' is 
a multiple of the m-th row of D, i.e. ~ for all m = 1,... ,p. We want to 

show that this bijection is the identity. First observe that, as the diagonal of D' and D is (1,...,1), 
- = T>rn,cT{Tn) foi' all m = 1,... ,p. Now let US consider a cycle in this permutation , i.e. 

mi,..., m,;+i = mi, rj > I, rrii f m^ for \ < l < k < rj and with a{mf) = m^+i for 1 < t < p. 
If this leads to a contradiction, we can conclude that a is the identity. As Dm,m = 1, m 

i-e. 9^0 fori < i < p. This corresponds to a cycle in with product 


n 


DL 




n 


1 


D 


rrit 


(15) 


As D' is a solution of ( [T0| , CP(I — D') < 1, hence the product on the left hand side of equation ( ffS] ) 
is in absolute value strictly smaller than 1, see 0. Analogously, as 7 ^ 0 for c = 1,... ,t], 

the sequence m^+i,mrj,..., mi corresponds to a cycle with product 


n D 


,Tnt+i • 


Using the same argumentation as for D', this product is in absolute value strictly smaller than 1, 
which contradicts Hence such cycles of length > 2 do not exist and tr is the identity. Hence, 


D' = D. 


“only if”: As above define as the m-th row of D and let us write G for the m-th unit 
vector for m = 1,... ,p. Assume that ( [T3| ) is not true, i.e. there exist k,l G {1,... ,p} such that for 
allj,/ e J, 

(AScj)fcfc(ASc,j')ii = (AScj)/i(AScj')fcfe- ( 16 ) 

Without loss of generality let us fix a j' G J with {AUc,j')kk 7^ 0 , and define A := 
{A^c,j' )kk. If such a j' does not exist, we can apply the same argumentation as below 
but with the k and I interchanged and A := 0. 
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Note that the definition of A does not depend on j and that by equation Q we have ASxj = 
ASc,jD“^. Then, for i5 S K we can define DJ,, := D/j, + (5D/, and DJ, := D;, — (5ADfc, 
and we obtain for all j € J 

DJ^ASx.,D;,. = (uz - 5Aufc)^ASc,,(ufc + <5u,) 

= (5(AScj)/i — (5A(AScj)fcfc 

= 0 . 

In the second equation we used ( [T6] l. Furthermore, for small 5 let us scale such that the fc-th 
component of the vector is 1. Analogously, let us scale DJ, such that the Z-th component of the 
vector is 1. Then we can define the matrix D' as the rows of D except for row k and I which are 
replaced by and DJ,. By above reasoning, this matrix satisfies 

D'ASxjD'^ G Diag(p) 

for all j & J and D' is invertible. Furthermore, the diagonal elements of D' are 1. Recall that the 
path-products of I — D over cycles are in absolute value smaller than 1, see Q. For small <5,1 — D' 
is close to I — D (in an arbitrary matrix norm) and hence the path products of I — D' over cycles 
are in absolute value smaller than 1 as well. As D is invertible, D' ^ D. Hence the solution to ( [T0| ) 
is not unique. This concludes the proof. 

□ 


B Polynomial-time algorithm 

Here, we provide the necessary theoretical result to show that BACKSHIFT has a computational cost 
of 0{\J\ ■ n ■ p^). Specifically, we show that Step 3 in Algorithm can be cast in terms of the 
classical linear sum assignment problem, having a computational complexity of 0{p^). 

Theorem 3. Let D G be a matrix with C'F(D) < 1, diagCD) = 1 and Dk,! ^ 0 far 

k,l G {1, ■ ■ • ,p}- For D' G define 

P(D0 :=niD'M|. 

k,l 


Furthermore define 

Vp := {D' : There exists a permutation a of . ,p'\ such that the a{m)-th row ofYT 
is collinear to the m-th row ofT)' and diag(D') = 1 }• 

Then, 

D = arg min P(D') = arg min logP(D'). 

D'G-Dp D'G-Dp 


Proof. Let D' G Vp with D' f D. Let us write D^, for the m-th row of D and analogously 
for the m-th row of D', m = 1,..., p. Now let ct be a permutation such that the cr(m)-th row of D 
is collinear to the m-th row of D'. As D' f D, we have that a f Id. As diag(D') = 1, 




D 


cr(?7i) 


.=D'„ 


It immediately follows that 



V 

P(D) = P(D'). 


As C'P(D) < 1 and a is not the identity, Y[m=i p |Dcr(m),m| < 1- As all elements of D and D' 
are nonzero, P(D) > 0 and P(D') > 0. Hence, P(D') > P(D). This concludes the proof. 

□ 
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Remark: We can define the relative loss function of moving row k to row I as 

^(fc,Z) = -log(|D'J). 

Then the linear assignment problem that minimizes this problem also yields the correct permutation 
for Step 3 in Algorithm [T] if it exists, i.e. the permutation cr on {1,... ,p} that minimizes 

J2Kk,crik)) 

k=l 

satisfies that is collinear to Do.(m).- 

Remark: Allowing for self-loops would lead to an identifiability problem, independent of the 
method. For every model with self-loops and CP < 1 there is a model without self-loops and 
CP < 1 yielding the same observational distribution in equilibrium. The connectivity matrix with¬ 
out self-loops can thus be seen as a representative of a whole class of connectivity matrices that 
allow self-loops. Specifically, if the connectivity matrix with self-loops is B*, define matrix T by 
PermuteAndScale(I — B*) = T(I — B*), where PermuteAndScale() is the operation de¬ 
fined in Step 3 of the BACKSHIFT algorithm. Technically, PermuteAndScale() is only defined 
for matrices that are nonzero outside of the diagonal. Using similar arguments as in Theorem]^ 
PermuteAndScale() can be extended to arbitrary matrices with nonzero diagonal elements. To 
be more precise, there exists a matrix T such that C'P(T(I — B*)) < 1, diag(T(I — B*)) = 1 
and such that T is the product of a diagonal scaling matrix with a permutation matrix. Then define 
B„eiu := I — T(I — B*), Bj^new = Toj and Cj^new = Tcj for all j & J . As T is the product 
of a diagonal scaling matrix with a permutation matrix, assumptions (B) and (C) are still fulfilled 
and ^j^new — (I ^new') {^Bj^new T — (I B ) (©j -j- Cj) — Xj for all J ^ . This 

implies that the two matrices B* with self-loops and 'Rnew without self-loops (since it has zeroes 
on the diagonal by construction) have both CP < 1 and yield the same distribution. 

C Intervention variances and model misspecification 

The method allows to validate and check the assumptions to some extent. This is especially im¬ 
portant in the data of ll22l as pointed out in f2S\ . The interventions can mostly be thought of as 
not changing the concentration of a biochemical agent but rather changing the activity of the agent, 
for example by inhibiting the reactions in which the agent is involved ESj . Under such a mecha¬ 
nism change, it is doubtful whether the interventions are well approximated by our model (|^ with 
independent shift-interventions. We can check the assumptions by the success of the joint diagonal- 
ization procedure. Specifically, we get an empirical version of Q when plugging in the estimators 
and can check whether all off-diagonal elements on the right hand side of Q are small or vanishing. 
We list below results for the seven experimental intervention conditions whose target is well de¬ 
scribed in Il26l . The element on the right-hand side of 0 with the largest absolute value is selected. 
We use now the Gram instead of the covariance matrix to be also sensitive to model-violations of 
the additional assumption (C’), see Section]^ though the results are almost identical whether using 
the Gram or covariance matrix. These large off-diagonal elements indicate a violated mechanism in 
the sense that the model (|^ does not fit very well, because either the interventions have not been of 
the assumed shift-type or the causal mechanism in which the agent is involved has changed under 
the intervention. 


Experiment Reagent 

Intervention 

largest mechanism violation 

3 

Akt-lnhibitor 

inhibits AKT activity 

PLCg O PKA 

4 

G0076 

inhibits PKC activity 

PKC O PIP2 

5 

Psitectorigenin 

inhibits PIP2 abundance 

PIP2 o PKA 

6 

U0126 

inhibits MEK activity 

MEK 4A PKA 

7 

LY294002 

changes P1P2/P1P3 mechanisms 

PKA o JNK 

8 

PMA 

activates PKC activity 

MEK PKA 

9 

/32CAMP 

activates PKA activity 

PKA o PKC 


The table above lists the results for the seven experimental conditions where we know the inter¬ 
vention mechanism, at least approximately. The results are interesting in that the most violated 
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mechanism (the largest entry in the off-diagonal matrix on the right-hand side of the empirical ver¬ 
sion of (0) occurs in 4 of the 7 experimental conditions directly at the intervention target. In 3 of 
these 4 cases, the violated mechanism concerns a relation that has a large entry in the estimated 
connectivity matrix. This corresponds well with the model of activity interventions in ESI . Note 
that we have not made use of the intervention targets in the estimation procedure. The interesting 
point is that we can use the model violations to estimate with some success where the interventions 
occurred. 


D Beyond covariances 


For the method above, we exploit differences in the covariance of observations across different en¬ 
vironments. We can also exploit a shift in the mean of the intervention strength c (and consequently 
in the observations x) when strengthening the condition (C) to (C’). Specifically, we require for (C’) 
that in each environment j (z J the shift in the mean E{cj) equals zero for all variables except at 
most one variable. The variable with a non-zero shift in the mean can change from one environment 
to another. Note that the counterpart of Q when using the Gram matrix instead of the covariance 
matrix reads 

(I-B)Gx,,(I-B)^ = Gc.,+Ge. (17) 

Under the stronger version (C’), the difference across environments of the right-hand side in ( [TT] ) 
is again a diagonal matrix and we can proceed just as above, by replacing the covariance matrices 
with Gram matrices throughout. If the assumption (C’) is satisfied, this allows identifiability of the 
graph in a wider range of settings (Theorem[^can be adapted in a straightforward manner by again 
replacing covariances with Gram matrices) but requires the stricter condition (C’). Since in practice 
it is often unclear whether the stricter condition is approximately true, we work mainly with the 
weaker assumption (C) and exploit only shifts in the covariance matrices. 


E Additional figures 


Setting 1 
n = 1000 
no hidden vars. 
mi = 1 


Setting 2 
n = 10000 
no hidden vars. 
mi — 1 


Setting 3 
n = 10000 
hidden vars. 
mi — 1 


Setting 4 
n = 10000 
no hidden vars. 
mi — 0 


Setting 5 
n = 10000 
no hidden vars, 
mi = 0.5 




Figure 5; Synthetic data. Stability selection results for BACKSHIFT with parameters E(l/) = 2 and iithr = 
0.75. The intensity of the edges illustrates the relative magnitude of the estimated coefficients, the width shows 
how often an edge was selected. The edge from node 6 to node 10 is associated with the smallest coefficient in 
absolute value. It is retained in none of the settings in the stability selection procedure. 
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